Spatio - Temporal Scene Gist Dynamics
نویسندگان
چکیده
Viewers can rapidly extract a holistic semantic representation of a real-world scene within a single eye fixation, an ability called recognizing the gist of a scene, and operationally defined here as recognizing an image’s basic-level scene category. However, it is unknown how scene gist recognition unfolds over both time and space — within a fixation and across the visual field. Thus, in three experiments, the current study investigated the spatiotemporal dynamics of basic-level scene categorization from central vision to peripheral vision over the time course of the critical first fixation on a novel scene. The method used a Window/Scotoma paradigm in which images were briefly presented and processing times were varied using visual masking. The results of Experiments 1 and 2 showed that during the first 100 ms of processing, there was an advantage for processing the scene category from central vision, with the relative contributions of peripheral vision increasing thereafter. Experiment 3 tested whether the above pattern could be explained by spatiotemporal changes in selective attention. The results showed that manipulating the probability of information being presented centrally or peripherally selectively maintained or eliminated the early central vision advantage. Across the three experiments, the results are consistent with a zoom-out hypothesis, in which during the first fixation on a scene, gist extraction extends from central vision to peripheral vision as covert attention expands outward. SPATIO-TEMPORAL SCENE GIST DYNAMICS 3 The Spatiotemporal Dynamics of Scene Gist Recognition When viewing a scene image, people may describe it with such terms as “a Bedroom” or “a person raking leaves.” Viewers can arrive at this semantic categorization of a real-world scene extremely rapidly—within a single eye fixation—and the theoretical construct for this process is often called “scene gist recognition” (Biederman, Rabinowitz, Glass, & Stacy, 1974; Castelhano & Henderson, 2008; Fei-Fei, Iyer, Koch, & Perona, 2007; Greene & Oliva, 2009; Larson & Loschky, 2009; Loschky & Larson, 2010; Malcolm, Nuthmann, & Schyns, 2011; Oliva & Torralba, 2006; Potter, 1976; Rousselet, Joubert, & Fabre-Thorpe, 2005). The scene gist construct is important for theories of scene perception because recognizing the gist of a scene affects later theoretically important processes such as attentional selection (Eckstein, Drescher, & Shimozaki, 2006; Gordon, 2004; Torralba, Oliva, Castelhano, & Henderson, 2006), object recognition (Bar & Ullman, 1996; Biederman, Mezzanotte, & Rabinowitz, 1982; Boyce & Pollatsek, 1992; Davenport & Potter, 2004; but see Hollingworth & Henderson, 1998), and long term memory for scenes (Brewer & Treyens, 1981; Pezdek, Whetstone, Reynolds, Askari, & Dougherty, 1989). Scene gist recognition has been operationalized in numerous ways, but usually in terms of the ability to classify a briefly flashed scene image at some level of abstraction, from the highly specific (e.g., “a baby reaching for a butterfly”)(Fei-Fei, et al., 2007; Intraub, 1981; Potter, 1976, pp. 509-510), to the basic level scene category (e.g., front yard)(Loschky, Hansen, Sethi, & Pydimari, 2010; Malcolm, et al., 2011; Oliva & Schyns, 2000; Renninger & Malik, 2004; Rousselet, et al., 2005), to the superordinate level scene category (e.g., natural)(Goffaux et al., 2005; Greene & Oliva, 2009; Joubert, Rousselet, SPATIO-TEMPORAL SCENE GIST DYNAMICS 4 Fize, & Fabre-Thorpe, 2007; Loschky & Larson, 2010), to whether the scene contains an animal (Bacon-Mace, Mace, Fabre-Thorpe, & Thorpe, 2005; Evans & Treisman, 2005; Fei-Fei, VanRullen, Koch, & Perona, 2005; Kirchner & Thorpe, 2006; Rousselet, FabreThorpe, & Thorpe, 2002), to the scene’s emotional valence (e.g., positive)(Calvo, 2005; Calvo, Nummenmaa, & Hyona, 2008). Thus, the theoretical construct of scene gist has been operationalized in many different ways regarding the semantic information that viewers acquire from a scene. Similarly to many previous studies, the current study operationally defines scene gist recognition as viewers’ ability to accurately categorize real-world scenes at the basic level (e.g., Joubert, et al., 2007; Loschky, et al., 2010; Loschky & Larson, 2010; Loschky et al., 2007; Malcolm, et al., 2011; McCotter, Gosselin, Sowden, & Schyns, 2005; Oliva & Schyns, 2000; Rousselet, et al., 2005). As noted above, a fundamental constant in the scene gist construct is that the specified semantic information is acquired within a single fixation. The fact that scene gist acquisition occurs within a single eye fixation has been shown by eye movement research in which the very first eye movement in a visual search task, which immediately follows the first fixation (typically placed at the center of the image prior to onset of the image), generally goes directly to an expected location based on the semantic category of the scene. For example, it has been shown when the search target was “chimney,” the first saccade on a scene often went directly to the roof of a house, even when there was no chimney in the picture (Eckstein, et al., 2006; see also Torralba, et al., 2006). Other studies have strongly suggested that gist recognition occurs within a single fixation by presenting extremely briefly flashed and backward masked scene images (to control processing time) and asking viewers to categorize the scenes at the basic-level. These SPATIO-TEMPORAL SCENE GIST DYNAMICS 5 studies have shown asymptotic basic level scene categorization performance at stimulus onset asynchronies (SOAs) of 100 ms (Biederman, et al., 1974; Potter, 1976) and an inflection point after SOAs as little as 35-50 ms (Bacon-Mace, et al., 2005; Loschky, et al., 2010; Loschky & Larson, 2010; Loschky, Larson, Smerchek, & Finan, 2008; Loschky, et al., 2007), which is far less than the average duration of a fixation during scene viewing of 330 ms (Rayner, 1998). Thus, in order to build on the findings from the latter studies, the current study investigates the time course of scene gist processing by varying processing time over a wide range of masking SOAs whose maximum is roughly equal to a single fixation. While much is known about how humans process and represent the gist of scenes, a key unknown is how scene gist recognition unfolds over both time and space —within a single eye fixation and across the visual field. Specifically, because viewers acquire scene gist within the temporal Window of a single fixation, any given piece of visual information in a scene has a fixed retinal eccentricity during that critical first fixation. Thus, because retinal eccentricity greatly affects visual processing (for review, see Strasburger, Rentschler, & Juttner, 2011; Wilson, Levi, Maffei, Rovamo, & DeValois, 1990), the retinal eccentricity of scene information must play a key role in scene gist acquisition (Larson & Loschky, 2009). However, this raises the novel question addressed in this study: Does the spatial variability in processing across the visual field undergo important changes over the time course of the critical first fixation on a scene? Interestingly, three plausible alternative hypotheses regarding such spatiotemporal variability in scene gist processing are suggested by the existing literature. SPATIO-TEMPORAL SCENE GIST DYNAMICS 6 The first hypothesis stems from the well-known differences in the speed of information transmission between central and peripheral vision from the retina to the brain, with peripheral visual information reaching the lateral geniculate nucleus (LGN) of the thalamus and the primary visual cortex (V1) before central visual information (Nowak, Munk, Girard, & Bullier, 1995; Schmolesky et al., 1998). Since peripheral vision is particularly important for scene gist recognition (Larson & Loschky, 2009), it is possible and plausible that this temporal processing advantage for peripheral vision may underlie the incredible speed of gist recognition (Calvo, et al., 2008; Girard & KoenigRobert, 2011). A second plausible alternative hypothesis is based on eye movement and attention research, which has shown that covert attention starts centrally in foveal vision at the start of each fixation, and, over time, extends out to the visual periphery (Henderson, 1992; White, Rolfs, & Carrasco, in press). This central-to-peripheral spatiotemporal order of visual processing could extend to the process of scene gist recognition during the critical first fixation on a scene, producing an advantage for central vision early in the first fixation. Finally, a third plausible alternative hypothesis is based on the idea that the rapid extraction of scene gist within a single fixation occurs in the near absence of attention (Fei-Fei, et al., 2005; Li, VanRullen, Koch, & Perona, 2002; Otsuka & Kawaguchi, 2007; Rousselet, et al., 2002) and in parallel across the field of view (Rousselet, Fabre-Thorpe, & Thorpe, 2002). In that case, neither central nor peripheral vision would be expected to play a larger role early or late in processing scene gist, but instead would be assumed to play equivalent roles throughout the first fixation. SPATIO-TEMPORAL SCENE GIST DYNAMICS 7 These three plausible alternative hypotheses cover three logical possibilities, either 1) an early advantage for peripheral over central vision, 2) the reverse, namely an early advantage for central over peripheral vision, or 3) no advantage for either central or peripheral vision. Below, we describe the research supporting each of these alternative hypotheses in greater detail. Central Versus Peripheral Vision & the Spatiotemporal Dynamics of Scene Gist Recognition The visual field can be roughly divided into two mutually exclusive regions, central and peripheral vision. Central vision, which includes both foveal and parafoveal vision, is contained within a roughly 5° radius of fixation (Osterberg, 1935; cited in Strasburger, et al., 2011, p. 3). We follow standard convention in studies of visual cognition by defining peripheral vision as the remainder of the visual field beyond central vision’s 5° radius (e.g., Hollingworth, Schrock, & Henderson, 2001; Holmes, Cohen, Haith, & Morrison, 1977; Rayner, Inhoff, Morrison, Slowiaczek, & Bertera, 1981; Shimozaki, Chen, Abbey, & Eckstein, 2007; van Diepen & Wampers, 1998). Our first of three alternative hypotheses regarding the spatiotemporal dynamics of scene gist acquisition is related to the fact that the vast majority of information in realworld scenes is contained within peripheral vision, which has lower spatial resolution, but a finer temporal resolution and faster information transmission to visual cortex than central vision (Livingstone & Hubel, 1988; Nowak, et al., 1995; Strasburger, et al., 2011; 1 Although these two visual areas are referred to as central vision, there are important anatomical and perceptual differences between them. The fovea contains the greatest concentration of cones (Curcio, Sloan, Packer, Hendrickson, & Kalina, 1987), whereas the parafovea has the largest concentration of rods (Curcio, Sloan, Kalina, & Hendrickson, 1990). Likewise, visual acuity is greater in the fovea than the parafovea (Westheimer, 1982; Wilson, et al., 1990). SPATIO-TEMPORAL SCENE GIST DYNAMICS 8 Wilson, et al., 1990). Neurophysiological studies of macaques have shown that visual information transmitted by the magnocellular retinal ganglion cells reaches the LGN and V1 ~20 ms faster than information transmitted by the parvocellular retinal ganglion cells (Nowak, et al., 1995; Schmolesky, et al., 1998). This advantage has been estimated to be substantially larger (90 ms) for peripheral vision in humans, as shown for discrimination of Gabor orientation in central vision (4° eccentricity) versus peripheral vision (10° eccentricity)(Carrasco, McElree, Denisova, & Giordano, 2003). The visual transmission advantage for peripheral vision could be critical for processing real-world scene images, including the recognition of a scene’s basic level category, especially at the early stages of scene processing. Recent studies have shown the importance of peripheral vision for processing real-world scene images. Larson and Loschky (2009) showed that the central 5° of an image could be completely removed from a scene with no decrease in basic level scene categorization performance. Conversely, presenting only the central 5° of a scene, while blocking scene information beyond that, produced worse categorization performance than when the entire scene image was presented. Similarly, Boucart and colleagues (Boucart, Moroni, Thibaut, Szaffarczyk, & Greene, 2013) showed the usefulness of peripheral vision for scene categorization by presenting scene images to the left and right of fixation and having viewers indicate the side with the target category. Performance was good (73% accuracy), even for scenes presented at up to 70° eccentricity. Similar results have been shown for animal detection in scenes using far peripheral vision (Thorpe, Gegenfurtner, Fabre-Thorpe, & Bulthoff, 2001). The above results show that peripheral vision conveys critical information for basic level scene categorization despite its low SPATIO-TEMPORAL SCENE GIST DYNAMICS 9 spatial resolution. Given that information from peripheral vision is transmitted to the LGN and V1 faster than information presented in central vision, this could produce better scene gist recognition in peripheral vision than central vision at the earliest stages of scene processing (Calvo, et al., 2008; Girard & Koenig-Robert, 2011). Interestingly, a separate body of literature on attention in scenes suggests a second alternative hypothesis regarding the spatiotemporal processing of scene gist in the first fixation. Henderson (1992) has argued for the “sequential attention model” in which attention starts in central vision for each eye fixation and is later sent to the target of the next saccade in the visual periphery towards the end of the fixation. Research on reading processes has shown evidence consistent with this hypothesis (Rayner, et al., 1981; Rayner, Liversedge, & White, 2006; Rayner, Liversedge, White, & Vergilino-Perez, 2003), and later research has shown similar findings for visual search in scenes and scene memory (Glaholt, Rayner, & Reingold, 2012; Rayner, Smith, Malcolm, & Henderson, 2009; van Diepen & d'Ydewalle, 2003). For example, van Diepen and d’Ydewalle (2003) found that in a “non-object” search task, masking foveal information early in a fixation was more detrimental than masking peripheral information. This deleterious effect was observed with both the search task and eye movement measures, suggesting that at the beginning of each fixation, information from the center of vision was processed first, followed by the information contained in the visual periphery. However, van Diepen and d’Ydewalle argued that their findings might not apply to other tasks: Given the task demands in the present experiments, objects of moderate size were of primary importance, and had to be inspected foveally. Conceivably, in other tasks a much larger part of the stimulus is processed SPATIO-TEMPORAL SCENE GIST DYNAMICS 10 at the beginning of fixations (e.g., when the scene identity has to be determined)[emphasis added]. Obviously, in the latter tasks the area that will affect fixation durations can be expected to be much larger than just the foveal stimulus. (2003, p. 97) That is, while there is evidence of the sequential allocation of attention from central to peripheral vision during extended visual search in scenes, it may not apply to the rapid acquisition of scene gist at the beginning of the first fixation on a scene, when peripheral vision may be more important. A more recent study by (Glaholt, et al., 2012) has further tested the sequential attention model with real-world scenes in both visual search and scene recognition memory tasks. In that study, either central vision or the entire image was masked, gazecontingently, after varying SOAs within the first 100 ms of each fixation (Glaholt, et al., 2012). That study showed that loss of central vision (with a 3.71° radius mask) only affected early processing. For scene recognition memory, a central mask with a 0 ms masking SOA reduced performance, but a 50 ms SOA did not. For visual search, a central mask with a 50 ms masking SOA degraded foveal target identification, but a 100 ms SOA did not. In contrast, masking the entire image disrupted both visual search and scene recognition memory as late as a 75 ms masking SOA. Thus, while central vision was only important early in a fixation, peripheral vision remained important until later in fixations for both visual search and scene recognition memory. These results are consistent with the sequential attention model, but do not speak to van Diepen and d’Ydewalle’s (2003) argument that the foveal-to-peripheral sequential attention model may not apply to scene gist recognition on the first fixation on a scene. Thus, this SPATIO-TEMPORAL SCENE GIST DYNAMICS 11 remains an important untested hypothesis regarding the spatiotemporal dynamics of scene gist acquisition. An interesting question is whether the sequential attention model can be interpreted as a spatio-temporally constrained version of the zoom-lens model (Eriksen & St. James, 1986; Eriksen & Yeh, 1985) in which covert attention zooms out on each fixation. Regardless, if the sequential attention model indeed applies to scene gist recognition on the first fixation on a scene (contrary to van Diepen & d’Ydewalle’s suggestion), then it suggests that basic level scene categorization should be best in central vision at the early stages of a fixation, while performance should converge at later processing times. Finally, a third alternative hypothesis regarding the spatiotemporal dynamics of scene gist acquisition is suggested by research on the role of attention in scene gist acquisition. This research questions whether attentional processes, here the spatiotemporal dynamics of visual attention across the visual field, underlie scene gist recognition, or whether preattentive processes, which work across the entire field of view in parallel, underlie scene gist recognition. Several recent studies have suggested that scene categorization requires little, if any, attentional resources (Fei-Fei, et al., 2005; Li, et al., 2002; Otsuka & Kawaguchi, 2007; Rousselet, et al., 2002), while other studies suggest that attention may yet play a role in obtaining meaningful scene information (Cohen, Alvarez, & Nakayama, 2011; Evans & Treisman, 2005; Walker, Stafford, & Davis, 2008). If scene gist recognition requires attention and is affected by attentional processes, then the spatiotemporal dynamics of attention during the initial fixation on a scene should SPATIO-TEMPORAL SCENE GIST DYNAMICS 12 produce differences in basic level scene categorization between central and peripheral vision at early processing times. Conversely, if scene gist acquisition is an attention-free process, based primarily on preattentive processes operating in parallel across the entire field of view (Fei-Fei, et al., 2005; Li, et al., 2002; Otsuka & Kawaguchi, 2007; Rousselet, et al., 2002), then no differences should be found between the utility of central versus peripheral vision for scene gist acquisition over the time course of the critical first fixation. This constitutes a well-founded and plausible null hypothesis regarding the role of the spatiotemporal dynamics of visual attention in scene gist acquisition. In sum, the goals of the current study were to determine 1) whether there is any difference in the utility of central versus peripheral vision in acquiring the basic level scene category of a scene over the time course of a single fixation, and 2) if such differences exist, whether they are more consistent with the idea that peripheral vision is processed most quickly and thus dominates early scene categorization, or whether processing expands from central vision outward over the course of a single fixation consistent with the sequential attention model. General Method of the Study The current study used a “Window” and “Scotoma” paradigm (see Figure 1) to evaluate the relative contributions of central versus peripheral vision to scene gist recognition over time (Larson & Loschky, 2009). We define a “Window” as a circular viewable region encompassing the central portion of a scene, while blocking the more eccentric peripheral information (McConkie & Rayner, 1975; van Diepen, Wampers, & d'Ydewalle, 1998). Conversely, a “Scotoma” blocks out the central portion of a scene and shows only the peripheral information (Rayner & Bertera, 1979; van Diepen, et al., SPATIO-TEMPORAL SCENE GIST DYNAMICS 13 1998). An inherent difficulty in such a method is that any difference in scene categorization could potentially be explained simply in terms of a difference in the amount of viewable information available in each condition. For example, if it were shown that peripheral vision had an advantage over central vision, one could argue that this was because peripheral vision had more information. Conversely, if central vision showed an advantage, this advantage could similarly be argued to be due to cortical magnification of foveal and parafoveal information. Thus, to control for such potentially confounding spatial attributes inherent to central and peripheral vision, it is first necessary to determine what we call the “critical radius”—that is, the radius that perfectly divides the central and peripheral regions of a scene into two mutually exclusive regions, each of which produces equivalent scene categorization performance when given unlimited processing time within a single fixation (i.e., when images are unmasked, and thus sensory memory generally lasts until a saccade is made)(Larson & Loschky, 2009). Given a critical radius for which unlimited processing time in a single fixation produces equal performance for information presented in both Window and Scotoma conditions, then we can ask whether limiting processing time produces any difference in scene categorization performance between those Window and Scotoma conditions based on the critical radius. It is important to note from the outset that such a research strategy is highly conservative, with a strong bias towards finding no difference between the Window and Scotoma conditions, given that the critical radius is defined in terms of producing equivalent performance between the two conditions (when there is no masking). Therefore, if variations in processing time do produce differences even when using the SPATIO-TEMPORAL SCENE GIST DYNAMICS 14 critical radius, then we can be confident that those differences do not stem from an imbalance in the amount of image content provided within the Window and Scotoma conditions respectively. It is for this reason that we took the conservative strategy of using the critical radius to balance the functional value of viewable imagery in the Window and Scotoma conditions. [[Insert Figure 1 about here]] General Hypotheses If the greater neural transmission speed of peripheral vision influences early differences in scene gist acquisition, then we would expect a scene categorization performance advantage for the Scotoma condition over the Window condition at early processing times. Conversely, if the sequential attention model applies to acquiring scene gist over the course of a single fixation, then we would expect a scene categorization performance advantage for the Window condition at early processing times compared to the Scotoma condition. Finally, if scene gist acquisition is a largely parallel and preattentive process, then Window and Scotoma conditions should be equally useful for scene categorization throughout the critical first fixation, and varying the processing times for Window and Scotoma conditions should produce no advantage for either condition, whether early or late in processing. We conducted three experiments to explore the spatiotemporal dynamics of scene gist acquisition in a single fixation and their relationship to visual attention. Across the three experiments, our data suggest that at the beginning of a fixation, attention is allocated to central vision. However, within the first 100 ms of scene processing, attention expands to encompass peripheral areas of the visual field. These findings are SPATIO-TEMPORAL SCENE GIST DYNAMICS 15 consistent with a combination of the sequential attention model and a spatiotemporally constrained interpretation of the zoom lens model of attention that we call the zoom-out hypothesis. These novel results, and the theoretical advance they provide, place fundamental spatiotemporal constraints on any theory of the processes involved in scene gist acquisition. Experiment 1 Experiment 1 was designed to investigate the relative utility of information in central versus peripheral vision for scene gist acquisition over the time course of single fixation. We used a Window/Scotoma paradigm to selectively present scene information to either central or peripheral vision, respectively, together with visual masking to vary processing time. This enabled us to test our three competing hypotheses that the transmission speed advantage for peripheral vision, the sequential attention model, or parallel and preattentive processes across the visual field would best explain the spatiotemporal dynamics of scene gist recognition. Our Window and Scotoma stimuli were constructed using a critical radius that produced equivalent basic level scene categorization in both the Window and Scotoma conditions when the stimuli were unmasked, in order to functionally equalize the viewable information presented in the Window and Scotoma conditions. By using the critical radius, however, we greatly reduced the chances of rejecting the null hypothesis when comparing the Window and Scotoma conditions. Thus, even relatively small differences found between the two conditions as a function of SOA would indicate differences in the spatiotemporal dynamics of scene gist recognition. SPATIO-TEMPORAL SCENE GIST DYNAMICS 16 Method Participants. There were 56 participants (33 female), whose ages ranged from 18-32 years old (M = 19.59, SD = 2.02). All had normal or corrected-to-normal vision (20/30 or better), gave their Institutional Review Board-approved informed written consent, and received course credit for participating. Design. The experiment used a 2 (Window vs. Scotoma) x 6 (processing time) within-subjects design. There were 28 practice trials followed by 240 recorded trials. Stimuli. Window and Scotoma stimuli were created from circularly cropped scene images having a diameter of 21.9° (i.e., a maximal radius/retinal eccentricity of 10.95°) at a viewing distance of 63.5 cm, using a forehead and chin rest. We interpolated the size of the critical radius based on the prior results of Loschky and Larson (2009) and confirmed through pilot testing that a critical radius of 5.54o (170 pixel radius) produced equal performance in both the Window and Scotoma conditions when stimuli were presented for 24 ms unmasked. Window images presented 25.6% of the viewable scene area inside the critical radius, while 74.4% of the viewable scene area was presented outside the critical radius in the Scotoma condition. We used a total of 240 images, which were comprised of ten scene categories (5 natural: beach, desert, forest, mountain, river; 5 man-made: farm, home, market, pool, street). Thus, the 240 scene images were randomly assigned to each viewing condition, and each scene was presented only once. The circular scene stimuli were presented on a 17” ViewSonic Graphics Series CRT monitor (Model G90fb). Masks were scene texture images generated using the Portilla and Simoncelli algorithm (2000). These types of masks have been shown to be highly effective at SPATIO-TEMPORAL SCENE GIST DYNAMICS 17 disrupting scene gist processing because they contain second-order and higher-order image statistics similar to real-world scenes but do not contain any recognizable information (Loschky, et al., 2010). Masks were identical in shape and size to the stimuli they masked—Window stimuli were masked by Window masks, and Scotoma stimuli were masked by Scotoma masks (see Figure 2). This was done to avoid metacontrast masking, which tends to produce type B (u-shaped) masking functions (Breitmeyer & Ogmen, 2006; Francis & Cho, 2008). All images, including targets and masks were equalized in terms of their mean luminance and RMS contrast (see Loschky, et al., 2007 for details on equalizing mean luminance and RMS contrast). Scene information contained outside the Window, or inside the Scotoma, was replaced by neutral gray equal to the mean luminance value of our stimuli. The same gray value was used for the blank screens and the background of the fixation point and category label. [[Insert Figure 2 about here]] Procedures. After completing a preliminary visual near acuity test (using a Snellen chart), participants were seated in front of a computer monitor. Participants were first familiarized with the 10 scene categories by showing them four sample images from each category together with their respective category labels. Participants then completed 30 unrecorded practice trials before completing the 240 experimental trials. The sample and practice stimuli were not used again in the main experiment. Figure 2 shows a schematic of a trial in each of the two viewing conditions (Window vs. Scotoma). We used an EyeLink 1000 remote eyetracking system with a foreheadand chin-rest to maintain a constant viewing distance. The eyetracker was programmed with a “fixation failsafe” algorithm to ensure that participants were fixated SPATIO-TEMPORAL SCENE GIST DYNAMICS 18 in the center of the screen. If the participant was not fixated within a 1o x 1o bounding box at the center of the image when they pressed the gamepad button to initiate a trial, the trial was recycled and did not initiate. Thus, the participant was always fixating the center of the screen when they initiated a trial. After a 48 ms delay, the target image was flashed for 24 ms, and following the prescribed interstimulus interval (ISI) of 0, 71, 165, 259 or 353 ms (which produced the target-to-mask SOAs of 24, 94, 188, 282, or 376 ms, or a no-mask condition), the mask was presented for 24 ms. It was predicted that the longest masking SOA, which is only slightly longer than the average fixation durations on scene images (330 ms), would be equivalent to the no-mask condition. This is because the retinal image in the no-mask condition would be masked by a saccade (Irwin, 1992) after, on average, 330 ms. The shortest SOA was based on the shortest stimulus duration that produced above-chance performance with this task and stimuli based on pilot testing. The other SOAs were chosen to provide roughly equal steps between the two extremes. Following the mask presentation, there was a 750 ms blank screen, and then a category label was presented until the participant responded using a handheld gamepad. The category label was a valid description of the scene on 50% of trials (and invalid on the other 50%). If the label was valid, participants were instructed to press the “yes” button on their game pad, and otherwise to press the “no” button. The presentation of Window and Scotoma scene image conditions were randomized throughout the experiment. All scene image categories appeared equally often, in random order. All 2 The monitor’s 85 Hz refresh rate allowed images to be presented for multiples of 11.76 ms. SOAs were calculated based on this refresh rate, with the reported SOAs rounded to the nearest whole number. SPATIO-TEMPORAL SCENE GIST DYNAMICS 19 category labels appeared equally often, and invalid labels were randomly selected from the remaining nine categories without replacement. Results Precursors. Due to poor task performance, we eliminated data from participants whose average accuracy was at or below the 5 percentile (< 51.98% correct, 2 participants). The fixation data for each participant was then filtered spatially and temporally to ensure that the point of fixation was within the 1o x 1o bounding box at the center of the image for the entire time period from the onset of the target to the offset of the mask, and that there was only a single eye fixation during this time period. Any trials that did not meet these criteria were discarded. 17.1% of the trials were removed from the analysis, resulting in a total of 11,337 trials that satisfied the experimental constraints. A greater proportion of trials were eliminated from the 376 ms SOA (22% of these trials were removed; 1,654 trials satisfied the experimental constraints) and the 282 ms SOA (16% of the trials were removed; 1,799 trials satisfied the experimental constraints) compared to the remaining SOAs (< 11% of the trials were removed; 1,919-1,993 trials satisfied the experimental constraints per SOA condition). This was due to the fact that viewers were more likely to spontaneously make an eye movement within the 282 ms and 376 ms SOA conditions than at the shorter SOAs (< 188 ms SOA). Main analyses. As assumed by use of the critical radius, basic-level scene categorization performance did not differ between the Window and Scotoma conditions in the no-mask condition, t (53) = 0.57, p = .57, Cohen’s d = .07. This equivalence was shown by calculating the reciprocal of the JZS Bayes factor (= 0.125) from the t-value, SPATIO-TEMPORAL SCENE GIST DYNAMICS 20 which showed substantial evidence in favor of the null (Wetzels et al., 2011a). Thus, crucially important for our method, the critical radius produced equal performance in the Window and Scotoma conditions when processing time lasted for a single eye fixation (i.e., when there no mask, and thus the subject’s next eye movement masked their retinal image of the scene). The remaining data were analyzed with a 2 (viewing condition [Window vs. Scotoma]) x 5 (SOA [24, 94, 188, 282, and 376]) within-subjects factorial ANOVA, and a trend analysis was performed to determine if there were any differences in the psychophysical functions between the two viewing conditions over time. Trend analyses are only reported for interval scaled independent variables with more than two levels. The results of Window and Scotoma performance across SOAs are shown in Figure 3. As expected, scene categorization performance increased with processing time, as shown by a large and significant main effect of processing time, F(4, 212) = 77.40, p < .001, 3 The JZS Bayes Factor can be used to determine the degree of support for the null hypothesis (Rouder, Speckman, Sun, Morey, & Iverson, 2009). Wetzels et al. (2011b, Table 1) gives interpretations of 1/(JZS Bayes Factor) as calculated by Rouder et al. (2007). The null hypothesis is stated below as H0, and the alternative hypothesis is stated as HA. Those values and interpretations are provided below: 1/Bayes factor Interpretation >100 Decisive evidence for HA 30–100 Very strong evidence for HA 10–30 Strong evidence for HA 3–10 Substantial evidence for HA 1–3 Anecdotal evidence for HA 1 No evidence 1/3–1 Anecdotal evidence for H0 1/10–1/3 Substantial evidence for H0 1/30–1/10 Strong evidence for H0 1/100–1/30 Very strong evidence for H0 <1/100 Decisive evidence for H0 SPATIO-TEMPORAL SCENE GIST DYNAMICS 21 Cohen’s f = .752, as well as significant linear trend, F (1, 53) = 257.16, p < .001. There was no main effect detected in the Window vs. Scotoma comparison, F(1,53) = 1.53, p = .22, Cohen’s f = .031. Our chief interest was whether or not there would be a significant interaction between the Window/Scotoma viewing conditions and processing time. As shown in Figure 3, there was a significant interaction F(4, 212) = 4.05, p = .003, Cohen’s f = .150, such that there was an advantage for the Window conditions over the Scotoma conditions, but only at the shortest SOA. This was verified with multiple Bonferroni corrected t-tests (critical p-value = .01) which compared scene categorization performance between the Window and Scotoma conditions at each SOA (24 ms: t (53) = 3.19, p = .002, Cohen’s d = 0.45; all other ts < 1.82, ps > .07). This is consistent with the hypothesis that processing of scene gist begins in central vision and expands outward over time. As predicted based on our use of the critical radius stimuli, the longest SOA (376 ms) produced identical performance between Window (M = 82.1%, SD = .10) and Scotoma conditions (M = 82.6%, SD = .12), t(53) = 0.34, p = .74, Cohen’s d = 0.04, JZS Bayes Factor = 8.86. The longest SOA (M = 82.4%, SD = .09) also produced identical results to the no-mask condition (M = 82.8%, SD = .07), t(53) = 0.33, p = .74, Cohen’s d = 0.03, JZS Bayes Factor = 8.89, which was necessarily predicted by our assumption that a 376 ms masking SOA is equivalent to the complete processing time for a single fixation afforded by the no-mask condition. 4 Cohen’s f magnitudes for small, medium, and large effect sizes are generally given as .10, .25, and .40, respectively (Cohen, 1988). 5 The trend analyses confirmed the results of the ANOVA by showing that the interaction was significant as a linear trend, F (1, 53) = 7.09, p = .010. As this suggests, and as seen in Figure 3, as processing time increased, performance increased at different rates for the Window and Scotoma scene images. SPATIO-TEMPORAL SCENE GIST DYNAMICS 22 [[Insert Figure 3 about here]] Discussion The results of Experiment 1 showed that the Window condition produced moderately and significantly better scene categorization performance than the Scotoma condition at the earliest processing time (24 ms SOA). This advantage was gone by 94 ms SOA, and thereafter processing was equivalent between both conditions. These results are inconsistent with our hypothesis based on the peripheral vision transmission speed advantage, which predicted that the Scotoma condition would be better than the Window condition at the early stages of processing. Likewise, these results are inconsistent with the hypothesis that the basic level scene category can be recognized in the near absence of attention in parallel across the field of view (Fei-Fei, et al., 2005; Li, et al., 2002; Otsuka & Kawaguchi, 2007; Rousselet, et al., 2002). However, contrary to the suggestion of van Diepen and d’Ydewalle (2003), these results are consistent with the sequential attention model (Henderson, 1992, 1993) as applied to scene gist recognition. This assumes that attention starts at the point of fixation at the beginning of a fixation, and later extends to the next saccade target in the visual periphery. The data are also consistent with a combination of a) the sequential attention model and b) a spatiotemporally constrained version of the zoom lens model of attention (Eriksen & St. James, 1986; Eriksen & Yeh, 1985) in which attention zooms out from the center of vision over the course of a fixation, which we call the zoom-out hypothesis. Thus, at the beginning of the stimulus presentation (24 ms SOA), participants were more accurate when information was presented at the center of vision than if it was presented in the visual periphery (beyond 5.54o eccentricity in the Scotoma condition), suggesting that SPATIO-TEMPORAL SCENE GIST DYNAMICS 23 attention was focused near the center of vision, where there was information in the Window condition, but not the Scotoma condition. However, after an additional 70 ms processing time (94 ms SOA), information presented either centrally or peripherally produced equal performance, suggesting that by that time attention had expanded to encompass the entire image in both the Window and Scotoma conditions. Thereafter, performance monotonically increased at an equivalent rate in both conditions as processing time increased further. It is worth noting that asymptotic scene categorization performance was not reached until roughly 282 ms SOA, which is considerably longer than in most studies assessing basic level scene categorization using visual masking to control processing time. A simple explanation for this is that both the Window and Scotoma conditions were missing scene information that would otherwise be present in a whole scene image, and thus extra processing time was required to reach asymptotic performance. The fact that a whole image condition requires less processing time points to the importance of processing across the entire visual field (or at least the entire image), and thus the advantage for central information early in a fixation should be considered in relative rather than absolute terms. Experiment 2 Experiment 1 assessed the spatiotemporal dynamics of scene gist acquisition, operationalized in terms of basic level scene categorization, between 24 and 376 ms processing time. However, the data from Experiment 1 showed that the important scene processing differences between central and peripheral vision were in the first 100 ms of processing. The advantage for information presented centrally at 24 ms SOA SPATIO-TEMPORAL SCENE GIST DYNAMICS 24 disappeared by 94 ms and thereafter performance was similar across all SOAs up to and including 376 ms. Thus, a key question is what happens to the spatiotemporal dynamics of scene gist acquisition during the first 100 ms of viewing a scene? Specifically, when does information presented to central and peripheral vision become equally useful? Experiment 2 addressed these questions. Method Participants. There were 85 participants (48 females), whose ages ranged from 18-32 years old (M = 19.51, SD = 2.05). All had normal or corrected-to-normal vision (20/30 or better), gave their Institutional Review Board-approved informed written consent, and received course credit for participating. Design. The design of Experiment 2 was the same as Experiment 1 except that the SOAs in Experiment 2 were more densely sampled from the first 100 ms of processing, when meaningful differences in Experiment 1 were observed. As in Experiment 1, target and mask images were presented for 24 ms. However, the ISIs were 0, 12, 24, 47, 71, and 353 ms (producing SOAs of 24, 35, 47, 71, 94, and 376 ms). Because no difference was found between 376 ms SOA and the “No-Mask” condition in Experiment 1, the 376 ms SOA served the same function as a “No-mask” condition in Experiment 2 (i.e., providing a single eye fixation’s processing time). Thus, the SOAs used in Experiment 2 reflect a range of processing time from the minimum processing time necessary for above chance performance (24 ms SOA) to the point in processing in Experiment 1 when information presented to both central and peripheral vision first produced equal performance (94 ms SOA). The other SOAs were chosen to provide SPATIO-TEMPORAL SCENE GIST DYNAMICS 25 roughly equal steps between the two extremes. As in Experiment 1, participants completed familiarization and practice trials before the recorded trials. Stimuli and procedures. All stimuli and procedures were the same as in Experiment 1 except the different SOAs. Results Precursors. Data cleaning procedures were the same as in Experiments 1. Due to poor task performance, we eliminated data from participants whose average accuracy was at or below the 5 percentile (< 55.96% correct, 4 participants). The fixation data for each subject was then filtered spatially and temporally to ensure that the point of fixation was within the 1o x 1o bounding box at the center of the image for the entire time period from the onset of the target to the offset of the mask, and that there was only a single eye fixation during this time period. Any trials that did not meet these criteria were discarded. This resulted in a total of 17,444 trials that satisfied the experimental conditions after removing 15.5% of the trials from the analysis. Similar to experiment 1, a greater proportion of data was eliminated from the 376 ms SOA (19% of the trials were removed; 2,541 trials satisfied the experimental constraints) than the remaining SOAs (<12% of the trials were removed per SOA condition; 2,932 to 3,011 trials satisfied the experimental constraints per SOA condition). Main analyses. As assumed by use of the critical radius, scene categorization performance in the Window and Scotoma conditions was not different at the longest SOA (376 ms), t (80) = 0.81, p = .42, Cohen’s d = .08, with the JZS Bayes Factor (= 0.121) indicating substantial evidence in favor of the null (Wetzels, et al., 2011a). Thus, when SPATIO-TEMPORAL SCENE GIST DYNAMICS 26 given the equivalent of a single eye fixation to process the scenes, equal performance was found in the two viewing conditions. A 2 (Viewing condition [Window vs. Scotoma]) x 5 (SOA [24, 35, 47, 71, and 94 ms]) within-subjects factorial ANOVA was used to analyze performance between the two viewing conditions over the first 100 ms of scene processing. As shown in Figure 4, as expected, basic level scene categorization increased with processing time, F (4, 320) = 7.02, p < .001, Cohen’s f = .172. Of greater interest, however, the Window condition produced better scene categorization performance than the Scotoma condition, F (1, 80) = 23.69, p < .001, Cohen’s f = 167. However, the interaction between viewing condition and processing time did not affect scene categorization accuracy, F (4, 320) = 0.27, p = .90. The lack of an interaction indicates that the advantage for the Window image condition over the Scotoma image was present over the entire first 100 ms of processing. This result is consistent with both reading and scene perception research showing an advantage for information presented to central vision at the beginning of a fixation (Glaholt, et al., 2012; Rayner, et al., 1981; Rayner, et al., 2006; Rayner, et al., 2003; van Diepen & d'Ydewalle, 2003). Additionally, this pattern is consistent with the conclusion drawn from Experiment 1 that the scene information presented in central vision is used earlier than the information presented in the visual periphery. [[Insert Figure 4 about here]] [[Insert Table 1 about here]] Discussion 6 Trend analyses show that processing time produced a significant linear trend for accuracy, F (1, 80) = 6.11, p = .016. Additionally, the interaction between processing time and Window/Scotoma conditions on accuracy showed no evidence of a linear trend, F (1, 80) = .04, p = .84. SPATIO-TEMPORAL SCENE GIST DYNAMICS 27 The results of Experiment 2 suggest that during the first 100 ms of scene viewing, information presented in central vision produced better scene gist recognition than that presented in peripheral vision. The results of Experiment 1 showed that performance for scene information presented in either central or peripheral vision was equal by 94 ms SOA, however Experiment 2 showed that the central vision advantage was in fact present over the entire first 100 ms. This finding is consistent with previous research showing that in the early processing stages of an eye fixation, information processed by central vision is important for reading (Rayner, et al., 1981; Rayner, et al., 2006; Rayner, et al., 2003), visual search in scenes, and scene memory (Glaholt, et al., 2012; van Diepen & d'Ydewalle, 2003), whereas information in peripheral vision becomes increasingly important later in a fixation. Such results are therefore consistent with the sequential model of attention and the zoom-out hypothesis as applied to scene gist recognition, in which focal attention starts centrally and expands over time to encompass information in peripheral vision. Both explain the overall better performance for central vision during the first 100 ms of scene processing and the converging performance between information presented to central and peripheral vision over time. Thus, these results are also consistent with the idea that attention does indeed affect scene categorization and scene gist over the critically important first 100 ms of processing. Nevertheless, we draw this conclusion cautiously since we did not directly manipulate attention in either Experiment 1 or 2. Instead, we manipulated processing time and the availability of taskrelevant information as a function of retinal eccentricity. Experiment 3 addressed this issue by directly manipulating viewers’ attention while they carried out the scene categorization task in either the Window or Scotoma conditions. SPATIO-TEMPORAL SCENE GIST DYNAMICS 28
منابع مشابه
Image Structure of Scene Dynamics
The dynamics of a spatio-temporal scene is described by diierential and integral geometric invariants of the image structure of that scene at various resolutions. This image structure is represented by continuous, semi-discrete or discrete similarity jet-spaces of spatio-temporally bounded input images. These jet-spaces are found by extending classical scale space theory for spatio-temporally b...
متن کاملKalman Filtering Motion Prediction Forrecursive Spatio - Temporal Segmentation
In the framework of computer vision, the spatio-temporal se-gmentation procedure plays a central role. It aims at identifying in the input image, semantically meaningful features that are relevant for the problem at hand. In this paper, these features are selected to be the objects forming the scene. The objects are deened by their properties of temporal and spatial coherence through the video ...
متن کاملA Computational Model of Extrastriate Visual Area MT on Motion Perception
Human vision system are sensitive to motion perception under complex scenes. Building motion attention models similar to human visual attention system should be very beneficial to computer vision and machine intelligence; meanwhile, it has been a challenging task due to the complexity of human brain and limited understanding of the mechanisms underlying the human vision system. This paper model...
متن کاملVideo Shot Boundary Detection - Comparison of Gist and Segmantation
Many algorithms have been proposed for detecting video shot boundaries and classifying shot and shot transition types. Here we are using two different methods for comparison, using GIST(Gesture Interpretation using Spatio-Temporal Analysis, where Spatio-Temporal means which has both space as well as time properties like the movement of hand which shows the variation in both space as well as tim...
متن کاملBayesian Learning on Graphs for Reasoning on Image Time-series
Satellite image time-series (SITS) are multidimensional signals of high complexity. Their main characteristics are spatio-temporal patterns which describes the scene dynamics. The information contained in SITS was coded using Bayesian methods, resulting in a graph representation [2]. This paper further presents a concept of interactive learning for semantic labeling of spatio-temporal patterns ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013